(S, C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases

نویسندگان

  • Nieves R. Brisaboa
  • Antonio Fariña
  • Gonzalo Navarro
  • María F. Esteller
چکیده

This work presents (s, c)-Dense Code, a new method for compressing natural language texts. This technique is a generalization of a previous compression technique called End-Tagged Dense Code that obtains better compression ratio as well as a simpler and faster encoding than Tagged Huffman. At the same time, (s, c)-Dense Code is a prefix code that maintains the most interesting features of Tagged Huffman Code with respect to direct search on the compressed text. (s, c)-Dense Coding retains all the efficiency and simplicity of Tagged Huffman, and improves its compression ratios. We formally describe the (s, c)-Dense Code and show how to compute the parameters s and c that optimize the compression for a specific corpus. Our empirical results show that (s, c)-Dense Code improves End-Tagged Dense Code and Tagged Huffman Code, and reaches only 0.5% overhead over plain Huffman Code.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

New adaptive compressors for natural language text

Semistatic byte-oriented word-based compression codes have been shown to be an attractive alternative to compress natural language text databases, because of the combination of speed, effectiveness, and direct searchability they offer. In particular, our recently proposed family of dense compression codes has been shown to be superior to the more traditional byte-oriented word-based Huffman cod...

متن کامل

Boosting Text Compression with Word-Based Statistical Encoding

Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30-35%, they allow fast direct searching of compressed text. In this article we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but t...

متن کامل

Simple, Fast, and Efficient Natural Language Adaptive Compression

One of the most successful natural language compression methods is word-based Huffman. However, such a two-pass semi-static compressor is not well suited to many interesting real-time transmission scenarios. A one-pass adaptive variant of Huffman exists, but it is character-oriented and rather complex. In this paper we implement word-based adaptive Huffman compression, showing that it obtains v...

متن کامل

Fast and Flexible Compression for Web Search Engines

In this paper we present the adaptation of a compression technique, specially designed to compress large textual databases, to the peculiarities of web search engines. The (s,c)-Dense Code belongs to a new category of compression techniques [17,9] that allows fast and flexible search directly on compressed files. However these methods are only suitable for large natural texts containing at leas...

متن کامل

Efficient Data Compression Scheme using Dynamic Huffman Code Applied on Arabic Language

The development of an efficient compression scheme to process the Arabic language represents a difficult task. This paper employs the dynamic Huffman coding on data compression with variable length bit coding, on the Arabic language. Experimental tests have been performed on both Arabic and English text. A comparison is made to measure the efficiency of compressing data results on both Arabic a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003